Load dataset (skipped for runtime reasons)

Two different versions of the count matrix exist: the one from GEO and the one created by us personally from cell ranger

Decided to go with cellranger matrix for analysis (probably no big difference in the end)

## An object of class Seurat 
## 19764 features across 4004 samples within 1 assay 
## Active assay: RNA (19764 features, 0 variable features)
##  1 layer present: counts

Quality control

## An object of class Seurat 
## 19764 features across 2313 samples within 1 assay 
## Active assay: RNA (19764 features, 0 variable features)
##  1 layer present: counts

Normalize data

Now testing scran Other tested options: logNormalize, scTransform

Identify highly variable features

## Finding variable features for layer counts
## When using repel, set xnudge and ynudge to 0 for optimal results

Scaling the data

## Centering and scaling data matrix

PCA

## Find clusters

## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2313
## Number of edges: 72170
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8372
## Number of communities: 9
## Elapsed time: 0 seconds

Plot UMAP

## 14:43:14 UMAP embedding parameters a = 0.9922 b = 1.112
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 14:43:14 Read 2313 rows and found 10 numeric columns
## 14:43:14 Using Annoy for neighbor search, n_neighbors = 30
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## 14:43:14 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 14:43:14 Writing NN index file to temp file /var/folders/p4/zn4sp81s5gq35dzmr0nknsjc0000gn/T//RtmpPrcBwt/file37c6520f3bb2
## 14:43:14 Searching Annoy index using 1 thread, search_k = 3000
## 14:43:15 Annoy recall = 100%
## 14:43:15 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 14:43:16 Initializing from normalized Laplacian + noise (using RSpectra)
## 14:43:16 Commencing optimization for 500 epochs, with 87192 positive edges
## 14:43:19 Optimization finished

Check known cancer marker genes

Specific MM marker genes: B2M, CCND1, Immunoglobulins (IGHM, IGKC)

Source: https://www.cancer.gov/about-cancer/diagnosis-staging/diagnosis/tumor-markers-list

Additional marker used in the Numbat paper for MM: MZB1

Additional marker used in the Casper paper for MM: SCD1, CD38

Annotate based on the cancer marker genes

##          
##           control   MM undefined
##   control     257    4       101
##   MM            0 1949         2
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
##                  p_val avg_log2FC pct.1 pct.2     p_val_adj
## RPS4Y1    0.000000e+00   7.873292 0.928 0.025  0.000000e+00
## HLA-DPB1  0.000000e+00   8.670499 0.804 0.025  0.000000e+00
## PTPRC     0.000000e+00   7.679161 0.801 0.024  0.000000e+00
## HLA-DPA1  0.000000e+00   8.760905 0.779 0.021  0.000000e+00
## HLA-DRB1 3.669508e-286   8.624379 0.674 0.014 7.252415e-282
## CD99     1.507533e-284   8.114201 0.666 0.013 2.979488e-280
## [1] 3488

Check which cancer marker genes are within the DE genes:

##              p_val avg_log2FC pct.1 pct.2     p_val_adj
## B2M   7.780115e-08 -0.1736831 1.000 1.000  1.537662e-03
## IGHM  7.730073e-28  0.2699701 0.544 0.998  1.527772e-23
## MZB1 1.272309e-140 -2.9218435 0.304 0.964 2.514592e-136
## CD38  7.070348e-21 -0.4336177 0.204 0.569  1.397384e-16

Plot the top positive and negative markers:

Compare annotation with results for SCEVAN and copyKat

print(table(seurat_obj$large_clusters,seurat_obj$copykat_pred))
##          
##           aneuploid diploid not.defined
##   control        14     211          36
##   MM           1943       1           5
print(table(seurat_obj$old_annot,seurat_obj$copykat_pred))
##            
##             aneuploid diploid not.defined
##   control          13     208          36
##   MM             1944       4           5
##   undefined         0       0           0
print(table(seurat_obj$large_clusters,seurat_obj$scevan_pred))
##          
##           filtered normal tumor
##   control       21    226    14
##   MM             4      1  1944
print(table(seurat_obj$old_annot,seurat_obj$scevan_pred))
##            
##             filtered normal tumor
##   control         21    223    13
##   MM               4      4  1945
##   undefined        0      0     0
print(table(seurat_obj$copykat_pred,seurat_obj$scevan_pred))
##              
##               filtered normal tumor
##   aneuploid          0      0  1957
##   diploid            0    212     0
##   not.defined       25     15     1